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Abstract 

In the usual forms of least squares nonlinear principal component analysis observed 
variables are quantified or transformed to optimize low-rank approximations. Thus 
NLPCA is linear PCA on optimally scaled variables. In this note we extend the approach 
by allowing for optimally scaled components. 
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1 Loss Function 

The problem studied in this paper is minimization of the least squares loss function 

a(X, B) — SSQ(Y - XB) (1) 

over X and B. Here SSQ() stands for the sum of squares, i.e. the square of the Frobenius 
norm. We use B_ for the matrix transpose of B. The (partial) unknowns are X, an n x p 
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matrix of components , and B, an mx p matrix of loadings. This is just least squares low-rank 
approximation. It is typically solved by directly using the singular value decomposition of Y, 
or by using some version of alternating least squares (iterate finding the optimal X for given 
B and the optimal B for given X). 

In order to at least partially identify X and B we require, without loss of generality, that 
XX = /, i.e. the components are orthonormal. This identifies X and B up to a rotation. All 
of this is completely standard. In this paper we introduce additional non-trivial constraints 
on the components. As an example, we could require for the first component 


Xn < X 21 < ■ ■ ■ < x n i, (2) 

but general partial orders and other cone or subspace constraints may also be useful. The 
important thing is that the constraints are imposed on each component seperately. 

We write for the set of all matrices satisfying the constraint for component s, which means 
we require X E 12, with 

= fli ® • • • ® Q p . 

This form of nonlinear principal component analysis is different from the more familiar form 
in which the columns of Y are transformed nonlinearly to optimize (1). See De Leeuw (2006) 
or De Leeuw (2014) for discussion and references. It is more closely related to the forms of 
constrained principal component analysis discussed in great detail by Takane (2014), and 
specifically to the form that allows separate constraints for separate dimensions (Takane, 
Kiers, and De Leeuw (1995)). In constrained principal component analysis, however, the 
emphasis is primarily on linear or subspace constraints on the components. 

2 Alternating Least Squares 

If we apply block relaxation to a least squares loss function we obtain an alternating least- 
squares algorithm (De Leeuw (1994), De Leeuw (2016a)). Alternating least squares algorithms 
as a general class of algorithms useful in data analysis were introduced by De Leeuw (1968), 
and then applied systematically in the ALSOS project, starting with De Leeuw, Young, and 
Takane (1976). 

0: Start with k = 0 and E fh 

1: B e argmin BgRmxP SSQ (Y — X^B) 

2: X (k+l '> E argmin Yen SSQ (Y - XB k ) 

3: If there is convergence, stop, otherwise k E- k + 1 and go back to step kl. 

Step 1 is a straightforward linear least squares problem, since we do not impose any constraints 
on B. Step 2 is more complicated, and we'll discuss it in detail in the next section. Convergence 
can be defined in terms of the decreasing sequence of loss function values, or in terms of 
changes in X and B from one iteration to the next. 
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3 Majorization 

Consider the problem of minimizing SSQ (Y — XB) over X £ 1C. Although the constraints are 
defined for each column of X separately, the matrix B combines columns and thus complicates 
the overall minimization problem. In order to solve the problem we use the majorization 
method, introduced by De Leeuw (1994), Heiser (1995), and Lange, Hunter, and Yang (2000), 
to separate the columns again. Majorization methods are discussed systematically in Lange 
(2016) and De Leeuw (2016a). In our overall algorithm this means we need an iterative 
majorization algorithm in step 2 of each cycle of the alternating least squares algorithm. 

For our majorization we choose a positive semi-definite diagonal matrix W such that W > ZLB, 
i.e. such that W — ZLB is positive semi-definite. We can use, for example, W = XI, with 
A the largest eigenvalue of B_B, or we can choose W — p diag(ZLB), with p the number of 
columns of X and B. Of course if E[B already is diagonal we can choose W = E[B. 

Suppose Z £ 1C, and in the majorization iteration we want improve Z. By defining 

U = Z + (Y - ZB)BW~ l 

and completing the square, we find the majorization inequality 

SSQ(Y - XB) < SSQ(Y - ZB) + tr (X - U)W (X - U ) - tr UWU, 
with equality if Z = X. 

In the majorization algorithm we iteratively minimize 

tr (X - U)W (X - U) = f) w s SSQ(z* s - u* s ) 

S =1 

over £ X s , which can obviously be done for each s separately. The update of column x* s 
of A" is the metric projection of column u* s of U on fl s . For constraint (2), for example, we 
apply isotone regression to w* s . 

4 Example: Linear Constraints 

Our first example is in the spirit of the DCDD method in Takane (2014). Each column of 
X is constrained linearly by x* s = G s a s . Thus must be in the subspace spanned by the 
columns of G s . The G s can be design matrices, indicator matrices, bases of polynomials or 
splines, and whatever else. 

Here is a simple example. 

gl <- matrix (0, 16, 4) 
gl [1: 4 , 1] <- 1 
gl [5 : 8 , 2] <- 1 
gl [9:12, 3] <- 1 
gl [13 : 16 , 4] <- 1 
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gl <- standardize (center (gl)) 

g2 <- rbind (diag (4), diag (4), diag (4), diag (4)) 
g2 <- standardize (center (g2)) 
g <- list (gl, g2) 
set.seed (12345) 

y <- standardize (center (mnorm(16, 5))) 

The linRes function approximates Y using the two constrained dimensions code in the list 
with matrices G\ and G 2 . Note that in the current implementation we only do a single 
majorization iteration to update the columns of A" before we update B. A more flexible 
strategy would be to allow for more than one inner majorization iterations. 

h <- linRes (y, g) 
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5 Example: Multiple Ordinal Variables 

In the nonlinear multivariate analysis system of Gift (1990) (see also Michailidis and De 
Leeuw (1998)) transformations can be nominal, ordinal, or numerical, and quantifications 
can be single or multiple. Multiple ordinal transformations have not really been implemented 
because their definition has never been entirely obvious. 

A definition that has been used before is to require that the first column x*i is isotone, while 
the remaining p— 1 columns are arbitrary, but orthogonal to the first. Thus we have X_X = / 
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with an isotone first column. It is easy to see that the orthonormality constraint is actually 
just for identification purposes, it does not enter into the majorization iterations. In fact, 
the algorithm in the funcion multOrd does not impose orthonormality. It uses the modified 
Gram-Schmidt method to transform to orthonormality after convergence, using the fact that 
modified Gram-Schmidt (without pivoting) merely normalizes the first column, which does 
not disturb the order relations. Clearly the same trick can be used for other cone constraints. 

It is interesting to observe that requiring isotonicity of the first column of X is actually 
equivalent to requiring that an isotone vector exists in the column space of X. Because if 
such a vector exists we can use the indeterminacy in the product XI3 to move that vector to 
the first column of A". 

For our numerical example we use the same Y as before. 

set.seed (12345) 

y <- standardize (center (mnorm(16, 5))) 

The program multOrd require the first column of X to be increasing. It uses Gram-Schmidt 
from De Leeuw (2015) and isotone regression from De Leeuw (2016b). 

h <- multOrd (y, 2) 
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fold 

2.0006170944 

fnew 

2.0006170936 

## 

itel 

118 

fold 

2.0006170936 

fnew 

2.0006170928 

## 

itel 

119 

fold 

2.0006170928 

fnew 

2.0006170922 

## 

itel 

120 

fold 

2.0006170922 

fnew 

2.0006170916 

## 

itel 

121 

fold 

2.0006170916 

fnew 

2.0006170911 

## 

itel 

122 

fold 

2.0006170911 

fnew 

2.0006170906 

## 

itel 

123 

fold 

2.0006170906 

fnew 

2.0006170903 

## 

itel 

124 

fold 

2.0006170903 

fnew 

2.0006170899 

## 

itel 

125 

fold 

2.0006170899 

fnew 

2.0006170896 

## 

itel 

126 

fold 

2.0006170896 

fnew 

2.0006170893 

## 

itel 

127 

fold 

2.0006170893 

fnew 

2.0006170891 

## 

itel 

128 

fold 

2.0006170891 

fnew 

2.0006170889 

## 

itel 

129 

fold 

2.0006170889 

fnew 

2.0006170887 

## 

itel 

130 

fold 

2.0006170887 

fnew 

2.0006170886 

## 

itel 

131 

fold 

2.0006170886 

fnew 

2.0006170884 

## 

itel 

132 

fold 

2.0006170884 

fnew 

2.0006170883 

## 

itel 

133 

fold 

2.0006170883 

fnew 

2.0006170882 

## 

itel 

134 

fold 

2.0006170882 

fnew 

2.0006170881 


6 Appendix: Weights 

In least squares majorization with non-diagonal positive definite weight matrix C we often 
want to find a diagonal matrix such that D > C and D is, in some sense, as small as possible 
(cf Groenen, Giaquinto, and Kiers (2003)). This problem is similar, but not identical, to 
minimum trace factor analysis (MTFA). In MTFA we find a diagonal D > 0 such that D < C 
and D is as large as possible. Typically “large” is defined as maximizing the trace or some 
other linear function of the diagonal elements. The majorization problem is different, not 
so much because “as large as possible” is replaced by “as small as possible”, but more so 
because D > C implies that D > 0. 

For our majorization we could also use the trace, so we want to minimize tr D over D > C. 
As in MTFA, this is a convex programming problem. It can be solved in many ways 
(see, for example, Jamshidian and Bentler (1998)), but each of them involves a non-trivial 
computational effort. In the body of the paper we have mentioned two more simple, and 
more easily computable, choices for D. The first is D — A (C)I, where A(C) is the largest 
eigenvalue of C. Of course if C is large computing the largest eigenvalue is not entirely 
trivial either. The second choice, which is actually used in the R programs in the appendix, 
is D = p diag(C), where p is the order of C and D. This is trivial to compute, but the 
corresponding D may not be very good (too large). It uses the trace as an upper bound for 
the largest eigenvalue of the correlation matrix corresponding with C, and there are much 
better bounds. 

This suggests using better bounds for the maximum eigenvalue, which are relatively easy to 
compute. We use the result that A(C) < ||C||, where ||C|| is any matrix norm. We could use 
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the Frobenius norm Vtr C 2 or the norm max^ =1 Ylt=\ l c st|- 
Let’s illustrate this with the 2x2 matrix 

2 -l' 

-1 3 

In the figure below all points in the convex area above the red branch from the hyperbola 
{x — 2) (y — 3) = 1 are D for which D > C. The trace of D is minimized at (3,4), where the 
blue line x + y = 7 is tangent to the hyperbola. This is the point labeled 1. Point labeled 2 
is p diag(C'), which is (4,6). The other four points are on the other blue line x = y. Point 
3 is at x — y — (5 + \/5)/2, which corresponds with the largest eigenvalue of C. Point 4 
has x — y — \/l5, which is the Frobenius bound, and point 5 has x — y — 4, which is the 
maximum row absolute sum norm. Finally point 6 uses the trace as the eigenvalue bound, 
which gives x = y = 5. 

par(pty="s") 

f <-function (x,y) (x-2)*(y-3)-l 
y<-seq(2 , 10 , length=100) 
x<-seq(2 , 10 , length=100) 
z<-outer(x,y ,f) 

contour (x=x,y=y,z=z, level=0 ,drawlabels=FALSE, lwd=3 , col="RED" ) 

lines(x,7-x,col="BLUE" ,lwd=2) 

lines(x, x, col = "BLUE", lwd = 2) 

lbd <- (5+sqrt(5))/2 

sbd = sqrt(15) 

points (lbd, lbd, col ="GREEN", cex = 2, pch = 19) 
points (4, 6, col ="GREEN", cex = 2, pch = 19) 

points (3, 4, col ="GREEN", cex = 2, pch = 19) 

points (4, 4, col = "GREEN", cex = 2, pch = 19) 

points (sbd, sbd, col = "GREEN", cex = 2, pch = 19) 

points (5, 5, col = "GREEN", cex = 2, pch = 19) 

text(3,4, labels = "1") 
text(4,6, labels = "2") 
text(lbd,lbd, labels = "3") 
text(sbd,sbd, labels = "4") 
text(4,4, labels = "5") 
text(5,5, labels = "6") 
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7 Appendix: Code 

7.1 linRes.R 

linRes <- 
function (y, 
g> 

bound = "M" , 
itmax = 1000, 
eps = le-10, 
verbose = TRUE, 
center = FALSE, 
standardize = FALSE) { 

if (center) { 
y <- center (y) 

> 

if (standardize) { 
y <- standardize (y) 

> 

p <- length (g) 

s <- 1 

xold <- NULL 

while (s <= p) { 

xold <- cbind (xold, g[[s]] l:ncol (g[[s]])) 

s <- s + 1 

> 

bold <- t (lm.fit (xold, y) $coeff icients) 
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rold <- y - tcrossprod (xold, bold) 
fold <- ssq (rold) 
itel <- 1 
repeat { 

bnew <- t (lm.fit (xold, y)$coefficients) 
cnew <- crossprod (bnew) 
if (bound == "M") { 

e <- max (rowSums (abs (cnew))) 
w <- diag (p) / e 

} 

if (bound == "E") { 

e <- max (eigen (cnew)$values) 
w <- diag (p) / e 

} 

if (bound == "F") { 

e <- sqrt (sum (cnew ~ 2)) 
w <- diag (p) / e 

} 

if (bound == "D") { 

w <- diag (1 / (p * diag (cnew))) 

} 

u <- xold + rold 0 / 0 * 0 / 0 bnew 0 / 0 * 0 / 0 w 
xnew <- NULL 
s <- 1 

while (s <= p) { 

xnew <- cbind (xnew, lm.fit (g[[s]], u[, s])$fitted.values) 
s <- s + 1 

} 

rnew <- y - tcrossprod (xnew, bnew) 
fnew <- ssq (rnew) 
if (verbose) { 
cat ( 

"itel ", 

formate (itel, width = 4, format = "d"), 

"fold ", 
formate ( 
fold, 

width = 15, 
digits = 10, 
format = "f" 

), 

"fnew ", 
formate ( 
fnew, 


12 


width = 15, 
digits = 10, 
format = "f" 

), 

"\n" 

) 

} 

if ((itel == itmax) II ((fold - fnew) < eps)) 
break 

itel <- itel + 1 
fold <- fnew 
xold <- xnew 
bold <- bnew 
rold <- rnew 

> 

return (list ( 
x = xnew, 
b = bnew, 
r = rnew, 
f = fnew 

)) 


7.2 multOrd.R 

source ("jbkPava.R") 
source ("gs.R") 

multOrd <- 
function (y, 

P> 

bound ="M", 
itmax = 1000, 
eps = le-10, 
verbose = TRUE, 
center = FALSE, 
standardize = FALSE) { 
set.seed (12345) 
if (center) { 
y <- center (y) 

> 

if (standardize) { 
y <- standardize (y) 

> 
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n <- nrow (y) 

xold <- standardize (center (cbind (l:n, mnorm (n, p - 1)))) 

bold <- t (lm.fit (xold, y) $coeff icients) 

rold <- y - tcrossprod (xold, bold) 

fold <- ssq (rold) 

itel <- 1 

repeat { 

bnew <- t (lm.fit (xold, y) $coeff icients) 
cnew <- crossprod (bnew) 
if (bound == "M") { 

e <- max (rowSums (abs (cnew))) 
w <- diag (p) / e 

} 

if (bound == "E") { 

e <- max (eigen (cnew) $values) 
w <- diag (p) / e 

> 

if (bound == "F") { 

e <- sqrt (sum (cnew 2)) 
w <- diag (p) / e 

} 

if (bound == "D") { 

w <- diag (1 / (p * diag (cnew))) 

} 

xnew <- xold + rold %*% bnew °/ 0 *°/o w 
xnew[, 1] <- jbkPava (xnew[, 1]) 
rnew <- y - tcrossprod (xnew, bnew) 
fnew <- ssq (rnew) 
if (verbose) { 
cat ( 

"itel ", 

formate (itel, width = 4, format = "d"), 

"fold ", 
formate ( 
fold, 

width = 15, 
digits = 10, 
format = "f" 

), 

"fnew ", 
formate ( 
fnew, 

width = 15, 
digits = 10, 
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f ormat 


), 

"\n" 

) 

} 

if ((itel == itmax) |I ((fold - fnew) < eps)) 
break 

itel <- itel + 1 
fold <- fnew 
xold <- xnew 
bold <- bnew 
rold <- rnew 

> 

z <- gsrc(xnew) 

return (list (x = z$q, b = tcrossprod (bnew, z$r) , r = rnew, f = fnew)) 

} 


7.3 auxilary. R 

mnorm <- function (n, p) { 

return (matrix (rnorm (n * p), n, p)) 

> 

center <- function (y) { 

return (apply (y, 2, function (x) 
x - mean (x))) 

} 

standardize <- function (y) { 

return (apply (y, 2, function (x) 
x / norm (x))) 

} 

ssq <- function (x) { 
return (sum (x ~ 2)) 

} 

norm <- function (x) { 
return (sqrt (ssq (x))) 

> 
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